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Abstract —Superior to state-of-the-art approaches which com¬ 
pete in table recognition with 67 annotated government reports 
in PDF format released hy ICDAR 2013 Table Competition, 
this paper contributes a novel paradigm leveraging large-scale 
unlabeled PDF documents to open-domain table detection. 
We integrate the paradigm into our latest developed system 
(PdfExtra) to detect the region of tables by means of 9,466 
academic articles from the entire repository of ACL Anthology, 
where almost all papers are archived by PDF format without 
annotation for tables. The paradigm first designs heuristics 
to automatically construct weakly labeled data. It then feeds 
diverse evidences, such as layouts of documents and linguistic 
features, which are extracted by Apache PDFBox and processed 
by Stanford NLP toolkit, into different canonical classifiers. We 
finally use these classifiers, i.e. Naive Bayes, Logistic Regression 
and Support Vector Machine, to collaboratively vote on the 
region of tables. Experimental results show that PdfExtra 
achieves a great leap forward, compared with the state-of- 
the-art approach. Moreover, we discuss the factors of different 
features, learning models and even domains of documents that 
may impact the performance. Extensive evaluations demon¬ 
strate that our paradigm is compatible enough to leverage 
various features and learning models for open-domain table 
region detection within PDE files. 

1. Introduction 

Tables are primarily used to present data such as the 
results of statistical analysis, experimental records, attributes 
of items, etc. The grid structure of the table - columns 
and rows - allows a reader to easily interpret and compare 
different items. Due to the advantages, tables have been 
widely adopted in many different articles such as web pages, 
academic publications, online manuals. Computer scientists 
who conduct research on information extraction take delight 
in engaging with tables that occur in those electronic docu¬ 
ments, as they are the natural sources to feed and populate 
relational databases. 

Some formats of the electronic documents are machine- 
readable, such as HTML, XML and even TEX. These for¬ 
mats derive from SGML (Standard Generalized Markup 
Language) and inherit the basic principle that the language 
pins a pair of specific tags to mark a snippet of text. For 


example, HTML files use {table) as the start and {/table) 
as the end, to indicate the region of a table. AI programs 
can easily recognize expected regions with the help of tags, 
and extract the information that we want with pre-defined 
actions. However, it is tedious for our human beings to read 
the markup language, because we are sensitive to the layouts 
of documents, and focus more on the contents. Therefore, 
the Portable Document Lormat (PDF) was designed as a file 
format to represent a document independent of the platform 
it displays, and to preserve the layouts both on screen and in 
print. These strengths draw much attention from the online 
publishing. So far, many academic papers and manuals have 
adopted PDF as the standard format. 

Unfortunately, we meet Waterloo when detecting the 
region of tables within PDF files, due to the lack of structural 
information. To the best of our knowledge, the latest off-the- 
shelf software, Apache PDLBo^ could only provide the 
coordinates {x, y) and the font style of each character in a 
PDF document. As table region detection is the fundamental 
and significant step for further information extraction from 
PDF files, fruitful approaches have been proposed in recent 
decades. However, they either simply design heuristic rules 
based on pre-defined layouts, or adopt supervised learning 
techniques fed by few annotated corpora from restricted 
domains. For instance, ICDAR 2013 set up a competition 
about table detection and structure recognition within 67 
annotated PDF documents posted by U.S. and E.U. govern¬ 
ments, where each document is accompanied by a XML file 
to indicate the location of tables. 

When we further apply these methods to some free 
access digital academic archives, such as IEEE Xplore 
and Springer Link, the variety of layouts and explosive 
amount of unannotated data expose the urgent demand on 
unsupervised or semi-supervised frameworks. By means of 
these frameworks, we do not have to spend much labor 
on annotation, but can leverage large-scale unlabeled PDF 
files. To the best of our knowledge, Klampfl et al. ||T] have 
recently proposed unsupervised table recognition methods 
applied on digital scientific articles. However, their work 
was purely based on heuristic rules and evaluated on 109 
files in total. We consider it not flexible enough to handle 
more PDF articles with variable layouts. 

1. https://pdfbox.apache.org/ 



Therefore, we firstly propose a novel solution which 
requires very little human efforts in detecting a table re¬ 
gion in PDF. Specifically, our approach reduces the cost 
of training data annotation by automatically generating the 
annotated data using a distant supervision technique 0, 
1^. The approach first collects a large amount of unlabeled 
PDF dataset, uses simple heuristic rules to automatically 
annotate the unlabeled dataset and then train a supervised 
classifer over the (weakly-labeled) training examples to 
predict the boundary of table region. The human efforts are 
almost neglegible in our approach because the unlabeled 
PDF dataset can be easily acquired from the web and the 
data annotation is automatically performed. 

Our experimental results confirm the promise of our 
approach. To evaluate our approach, we collected 9,446 
PDF files from ACL AnthologJ^ and developed a simple 
heuristic rule to automatically generate training examples 
from them. Then, a supervised classifier (an ensenble of 
Logistic Regression, Support Vector Machine and Naive 
Bayes) was trained over the weakly-labeled datasets to be 
applied over the test datasets from several different domains. 
Our evaluation shows that, first, our approach significantly 
outperforms a state-of-the art algorithm |[T| for the ACL test 
dataset. Furthermore, even for a out-of-domain test dataset 
(ICDAR 2013 competition dataset 0X our system achieve 
a significantly higher accuracy than the baseline system, 
indicating the effectiveness of a large amount of training 
dataset automatically generated in our approach. We also 
performed additional experiments to analyze the important 
features in detecting a table region and to compare different 
classifiers, and report the results of those experiments. 

2. Related Work 

A comprehensive review can be found in the final report 
of ICDAR 2013 Table Competition 0^ which announced the 
performances of recent academic and commercial systems 
on either table region detection or table structure analysis. 
Here we restrict our survey on a number of recent methods 
that attempt to discover table regions within PDF files. 

The first effort was the pdf2tabl^system 0, which used 
heuristics to detect the table region. It assumed that a table 
had more than one column, and a table region was formed 
by merging neighboring multi-lines. However, the algorithm 
could only handle pages with single-column layouts. 

The PDF-TREX system Q considered a set of words 
as seeds first, and identified tables in a bottom-up manner. 
Specifically, words were aligned and grouped to lines based 
on their vertical overlap, and line segments were obtained 
by applying hierarchical agglomerative clustering to the 
words. According to the number of segments, a line was 
categorized into three classes; text lines, table lines, and 
unknown lines. Then, the table region could be found by 
combining contiguous table lines or unknown lines. 

2. http://aclweb.org/anthology/ 

3. http://ieg.ifs.tuwien.ac.at/projects/pdf2table/ 


Supervised classification models were mainly adopted 
by Liu et al. 0, who designed a table detection method 
that leveraged heuristics to construct lines from individual 
characters and to select those sparse lines that occur within a 
table for training. Starting from a table caption, these sparse 
lines were then iteratively merged to a table region. This 
approach is very similar to the state-of-the-art unsupervised 
method ||T| and ours, except that it was built upon labeled 
text blocks instead of lines. 

The up-to-date approach ||T| did not rely on annotated 
data, but used complex heuristics to achieve comparable 
performances with supervise-based systems. Our system 
PdfExtra costs free on labeled data but covers large-scale 
PDF files with varies layouts. Therefore, we mainly compare 
the performance of system with the state-of-the-art unsuper¬ 
vised method |[TJ. 

3. Paradigm 

PdfExtra benefits a lot from the off-the-shelf software 
Apache PDFBox which can recognize almost all characters 
within a PDF document. Beyond the characters, the software 
also provides the horizontal and vertical coordinates, as well 
as the font style for each of them. Thus each “rich character” 
can be represented as a tuple: {character, x — axis,y — 
axis, font — type, font — size). In addition, Apache PDF¬ 
Box can merge the characters together into words, and return 
words in sequence that visually lay in the same line. There 
is nothing more that it can do to discover tables. Therefore, 
we leverage the outputs from Apache PDFBox and engage 
in predicting whether each line belongs to a table or not. 

Although we have formulated the table region detection 
task into a binary classification problem, we still suffer the 
lack of annotated training data. As illustrated by Figure 1, 
the paradigm that we have designed to fix the issue contains 
three phases: 

3.1. Heuristic annotation 

Inspired by the idea of distant supervision 0, 0, we 
adopt heuristics that can help automatically generate large- 
scale weakly labeled training examples. More specifically, 
we create a spider that downloads academic articles from 
ACL Anthologj^ in which almost all papers are archived in 
PDF format. 9,466 literatures in total the year 2000 to 2015 
are collected. For a PDF article, we process each page in 
three steps as follows, 

• Indicator Recognition: As all camera-ready drafts 
must conform to a limited number of official tem¬ 
plates to be published, the word ’’Table” or ’’Tab.” 
that appears in front of a line generally indicates the 
caption of a table. In other words, we find the lower 
or the upper boundary of the table region which 
depends on the templates. 

• Surrounding Contexts: The caption line plays a role 
in separating the table from the main body. Because 

4. http://aclweb.org/anthology/ 
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Figure 1. The proposed paradigm adopted by the PDFExtra system. 


we do not know which portion belongs to a table, we 
usually extend k lines up and down as the candidate 
context. 

• Positive v.s. Negative Examples'. After extracting 
these candidates, we assume that the group of lines 
with more blanks/margins will more likely locate in 
a table, rather than the other group. In this way, we 
can construct a balanced corpus for binary classifi¬ 
cation, even if it is weakly annotated by heuristics 
above. 

By means of the heuristics we have proposed, a large-scale 
weakly labeled dataset can be automatically constructed 
for training. For instance, the rules help us prepare more 
than 350,000 lines as training examples extracted from ACL 
Anthology. As each line is a sequence of words in which 
every ’’rich character” with its coordinates and font style, 
we can further process each word to mark its start and end 
coordinates in the horizontal direction. 

3.2. Feature identification 

The state-of-the-art approach 0 only concerns about the 
layouts of a PDF document. It iteratively includes a sparse 
block into a table in the buttom-up manner, where a block 
is identified as “sparse” if 1) their width is smaller than | 
of the average width of a text block, or 2) there exists a gap 
between two consecutive words in the block that is larger 


than than two times the average width between two words 
in the document. 

However, we believe that both linguistic and layout 
features are significant. Therefore, we select three kinds of 
features based on our observation that may contribute to 
detecting the region of tables. They are: 

• Normalized Average Margin (NAM): According to 
the horizontal coordinate of each word in lines, we 
calculate the average margin between two consec¬ 
utive words, so that each line is assigned by the 
feature. In most cases, the average margin between 
two consecutive words in the main body equals to 
the size of a space, and that in the tables usually 
occupies more. However, the average margin differs 
along with layouts, and generally the one-column 
layouts generate much larger margin than the two- 
column formats. Therefore, we normalize the aver¬ 
age margin within the same page to be the feature 
that represents the layouts. 

• POS Tag Distribution (PTD): It is a common consen¬ 
sus that we prefer displaying information in a more 
structural and condensed way in tables, rather than 
flowery language expressed by sentences in the main 
body of an article. Intuitively, more noun phrases 
appear in tables, but less adjectives and adverbs are 
used. This distinction leads to the diverse distribution 
of the part-of-speech (POS) tags, which we concern 





























as the second feature. There are 5 kinds of part- 
of-speech tags under our consideration processed 
by Stanford POS TaggejQ NN, VB, JJ, RB and 
OTHERS. 

• Named Entity Percentage (NEP): We extend the 
traditional scope of named entities and include the 
number and the time. Therefore, 5 kinds of named 
entity tags, i.e. PERSON, LOCATION, ORGANI¬ 
ZATION, NUMBER and TIME, are recognized by 
Stanford Named Entity Recognize^ For each kind 
of named entity, we compute its percentage in each 
line. 


3.3. Region detection 


Suppose that we have n examples in the weakly labeled 
training dataset. Each example is a line of “rich characters” 
mapped into a feature vector x along with its weak label y. 
We further use xnam, xptd and xnep to denote the three 
features, i.e. Normalized Average Margin (NAM), POS Tag 
Distribution (PTD) and Named Entity Percentage (NEP), re¬ 
spectively. Hence, each training example can be represented 
as (xW,?/W), in which xW = {x^n\m’^%d^^nep) ™tl 
(i) shows the index. 

Here we use three canonical classihers, i.e. Logistic 
Regression, Support Vector Machine and Naive Bayes, fed 
by the training examples above to decide whether a line of 
“rich characters” provided by Apache PDFBox belongs to a 
table or not, and explain the details about how we model the 
classihers based on the features and the weak labels from 
the corpora we have constructed; 

• Logistic Regressioi^(LR) assumes that we can score 
the i-th example to indicate whether it belongs to a 
table or not, by approximate its score as a linear 
function of the feature vector x^*); 




= 0^x(*) 


— SNAMX^jJ + Op'j'pX^prpj^ -f OpfppX^p^pp + 9q, 

( 1 ) 


where the 9 represents the vector of parameters 
along with the features. Then the classiher chooses 
the sigmoid function which maps the score into 
[ 0 . 0 , 1 . 0 ], to show the probability of the feature 
vector x^®) extracted from a table: 

Pr(j/W = i|xW) = -(2) 

' '' 1 _|_ e-se(x(>)) ’ ' ^ 

otherwise 

Pr(y<‘^ = 01 ==“') = , ^ ■ ( 3 ) 


5. http://nlp.stanford.edu/software/tagger.shtml 

6. http://nlp.stanford.edu/software/CRF-NER.shtml 

7. To implement the classifier, we integrate the LIBLINEAR: http: 
//liblinear.bwaldvogel.de/ into our system. 


The objective is to estimate the best parameter vector 
9 via maximizing the log-likelihood of all training 
examples: 

n 

arg max log TT (4) 

9 , 

• Support Vector Machin^(SVM) enhances the hy¬ 
pothesis of linear combination which is illustrated 
by Equation (1), by dehning the functional margin 
7 given a training example (x*^®\ 

— y(®)(-w^x*-*^ - 1 - b), (5) 

where y*^®) = {-fl, — 1 }, w is the vector of weights, 
and b is the bias. Among all of them, we use 7' to 
denote the minimum margin; 

7 ' = min 7^®1 ( 6 ) 

The objective shown as follows; 

7 

max —TT 

7,w,b ||w|| ( 7 ) 

s.t. y^®^(w^x^®^-I- 6 ) > 7 , z = 1 ,..., n 

results in a classiher that separates the positive and 
the negative training examples with a “gap”. 

• Naive Baye^ (NB) is different from the two clas¬ 
sihers mentioned above. Eirstly, it requires discrete 
variables as features, and we need to map xnam and 
Xnep which are originally described by continuous 
variables, into discrete variablef^ Secondly, rather 
than directly modeling Pr(y\x) as two discrimina¬ 
tive models mentioned above. Naive Bayes is a clas¬ 
sical generative model which attempts to describe 
the joint probability of x and y, i.e. Pr{x,y): 


Pr{x, y) = Pr{x\y)Pr{y). ( 8 ) 


We derive Pr(y|x) from Pr(y,x) based on the 
Bayes Rule, and choose the value of y with higher 
probability of Pr(y|x) as the result of prediction. 
Given a testing example we use the subsequent 
equation to predict the result; 


arg maxPr(y|x^'^^) 

V 

Pr{x(^)\y)Pr{y) 

arg max 

arg maxPr(x*-'^^|y)Pr(y). 

V 


(9) 


8. LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ is the well- 
known open-source software that can be leveraged by PdfExtra. 

9. We adopt https://github.com/ptnplanet/Java-Naive-Bayes-Classifier to 
implement the Naive Bayes classifier. 

10. Xnam generally ranges from 0.0 to 1.0. We set a step size that 
equals to 0.2 to map the continuous variables. For example, if 0.0 < 
x^NAM < 0.2, then xnam = and so on. 






TABLE 1. Statistics of benchmark datasets for table region detection. 


Dataset 

# Training files 

# Testing PDF files 

# Training Lines 

# Testing Lines 

ICDAR 2013 

50 

17 

804 

224 

ACL Anthology 

9,280 

186 

357,892 

346 


The key assumption of Naive Bayes is that all the 
features are independent from each other given y: 

= Pr{x^^\^\y)Pr{x%jj\y)Pr{x%'>j^p\y). 

( 10 ) 

Therefore, we believe it will behave differently from 
the other classifiers. 

4. Experiment 

We set experiments that conduct comparison between 
our paradigm and the state-of-the-art Heuristics approach ||T] 
evaluated on two datasets, i.e. ACL Anthology and ICDAR 
2013 Table Competition, with standard metrics. 

4.1. Datasets 


the testing examples will fall into one of the four cells, i.e. 
True Positive. For instance, if a system assigns the positive 
label (+1) to a ground-truth testing line which should be 
regarded as a negative example, that is a false positive (FP). 


^—.^Ground-truth 

Prediction 

POSITIVE (+1) 

NEGATIVE (-1) 

POSITiVE 

True Positive 

False Positive 

(+1) 

(TP) 

(FP) 

NEGATiVE 

False Negative 

True Negative 

(-1) 

(FN) 

(TN) 


Figure 2. 2 x 2 evaluation matrix for binary classification. 


We prepare two datasets from different domains. The 
datasef^ of ICDAR 2013 Table Competition is the bench¬ 
mark in which there are 67 ground-truth PDF documents 
of U.S. and E.U. governments. The size of the other ACL 
Anthology dataset is much larger, which contains 9,466 
academic articles from the year 2000 to 2015. It covers more 
than 10 top-tier conferences related to Computational Lin¬ 
guistics, such as ACL, EMNLP, COLING, NAACL, etc. Table 
1 shows the statistics of ICDAR 2013 and ACL Anthology 
datasets for evaluation. 

• ICDAR 2013: We divide the dataset into two parts. 
75% hies (50 documents) are used as training exam¬ 
ples, and 25% hies (17 documents) are prepared for 
testing. After processed by the Heuristic annotation, 
we get 804 lines left for training. And we manually 
annotate 224 lines from 17 testing documents for 
testing. 

• ACL Anthology: The paper published in 2015 are 
kept, and we label 346 lines of them as the ground- 
truth examples for testing. In addition, we gain 
357,892 lines from 9,280 academic articles as the 
weakly labeled training examples. 

4.2. Metrics 


Accuracy is a metric to measure the overall perfor¬ 
mance of binary classihcation. It concerns about all 
the testing examples, including the positives and the 
negatives, and indicates the proportion of lines that 
are made correct predictions. Therefore, 


Accuracy = 


#(rp) + #(riv) 


#(TP) + #(FP) + #{FN) + #iTN) ■ 

( 11 ) 

Precision and Recall are a pair of metrics that focus 
on the positive ground-truth lines. Specihcally, pre¬ 
cision represents the proportion of correct examples 
regarded as the positives, i.e.. 


Precision = 


#{TP) 

#lTP) + #{FPy 


( 12 ) 


and recall concerns about the proportion of positive 
predictions within all positive ground-truth exam¬ 
ples: 


Recall = 


#{TP) 

#{TP) + #{FNy 


(13) 


FI-measure is a trade-off between precision and 
recall, which measures the harmonic mean of the 
two metrics above: 


Since we regard the table region detection as binary 
classification problems, several standard metrics, such as 
Accuracy, Precision, Recall, FI-measure, are adopted for 
evaluating the performances. Each ground-truth line for 
testing is classified based on its features, and the output 
labels will be 4-1 or —1. As shown in Eigure 2, anyone of 

11. http://www.tamirhassan.com/files/icdar2013-competition-dataset-wit]i-gt. 
zip 


2 X Precision x Recall 

F 1-measure = --- - —-. (14) 

Precision -\- Recall 

4.3. Performances 

We use the four metrics above to measure the perfor¬ 
mances of our system PdfFxtra, compared with the state-of- 
the-art approach Heuristics Q. Both of them are evaluated 
by two benchmark datasets, i.e. ICDAR 2013 and ACL 


















TABLE 2. Performance comparison on ICDAR 2013 testing set, fed by ICDAR 2013 training examples. 


Approach 

Accuracy 

Precision 

Recall 

Fl-measure 

Heuristics 

PdfExtra 

0.5491 

0.6607 

0.5946 

0.7407 

0.3826 

0.5217 

0.4656 

0.6122 


TABLE 3. Performance comparison on ACL Anthology testing set, fed by ACL Anthology training examples. 


Approach 

Accuracy 

Precision 

Recall 

Fl-measure 

Heuristics 

PdfExtra 

0.5665 

0.7948 

0.5660 

0.7385 

0.3659 

0.8780 

0.4444 

0.8022 


TABLE 4. Performance comparison with different combinations of features on ICDAR 2013 testing set, fed by ICDAR 2013 

TRAINING EXAMPLES. 


Approach 

Accuracy 

Precision 

Recall 

Fl-measure 

Heuristics 

0.5491 

0.5946 

0.3826 

0.4656 

PdfExtra (NAM) 

0.5134 

0.5134 

1.0000 

0.6785 

PdfExtra (NAM -H PTD) 

0.7321 

0.7835 

0.6609 

0.7170 

PdfExtra (NAM + PTD + NEP) 

0.6607 

0.7407 

0.5217 

0.6122 


TABLE 5. Performance comparison with different combinations of features on ACL Anthology testing set, fed by ACL Anthology 

TRAINING EXAMPLES. 


Approach 

Accuracy 

Precision 

Recall 

Fl-measnre 

Heuristics 

0.5665 

0.5660 

0.3659 

0.4444 

PdfExtra (NAM) 

0.4740 

0.4740 

1.0000 

0.6431 

PdfExtra (NAM -H PTD) 

0.7312 

0.6564 

0.9085 

0.7621 

PdfExtra (NAM + PTD + NEP) 

0.7948 

0.7385 

0.8780 

0.8022 


TABLE 6. Performance comparison with different classifiers on ICDAR 2013 testing set, fed by ICDAR 2013 training examples. 


Approach 

Accuracy 

Precision 

Recall 

Fl-measure 

Heuristics 

0.5491 

0.5946 

0.3826 

0.4656 

PdfExtra (NB) 

0.7902 

0.7464 

0.8957 

0.8142 

PdfExtra (LR) 

0.6071 

0.6956 

0.4174 

0.5217 

PdfExtra (SVM) 

0.6607 

0.7407 

0.5217 

0.6122 

PdfExtra 

0.6607 

0.7407 

0.5217 

0.6122 


TABLE 7. Performance comparison with different classifiers on ACL Anthology testing set, fed by ACL Anthology training 

EXAMPLES. 


Approach 

Accuracy 

Precision 

Recall 

Fl-measure 

Heuristics 

0.5665 

0.5660 

0.3659 

0.4444 

PdfExtra (NB) 

0.7861 

0.7206 

0.8963 

0.7989 

PdfExtra (LR) 

0.7948 

0.7385 

0.8780 

0.8022 

PdfExtra (SVM) 

0.7919 

0.7347 

0.8780 

0.8000 

PdfExtra 

0.7948 

0.7385 

0.8780 

0.8022 


TABLE 8. Cross-domain performance comparison on ICDAR 2013 testing set, fed by ACL Anthology training examples. 


Approach 

Accuracy 

Precision 

Recall 

Fl-measure 

Heuristics 

0.5491 

0.5946 

0.3826 

0.4656 

PdfExtra (ICDAR) 

0.6607 

0.7407 

0.5217 

0.6122 

PdfExtra (ACL) 

0.7803 

0.7683 

0.7683 

0.7683 




Anthology. Table 2 and 3 show the results of the experi¬ 
ments, and we hnd out that PdfExtra achieves signihcant 
improvements beyond the latest approach. 

5. Discussion 

To deeply analyze the paradigm we have proposed, we 
discuss the factors that may impact the performance from 
three perspectives: 

5.1. Impact of features 

We try different combinations of features. They are the 
layout feature only {NAM), the layout with part-of-speech 
feature {NAM + PTD) and the combination of all the features 
{NAM + PTD + NEP). We keep collaboratively using the 
three classihcation models to vote the hnal prediction. Both 
of Table 4 and 5 demonstrate that pure layout feature does 
not perform well on detecting the table region, as shown by 
the experimental results of state-of-the-art Heuristics |[T) and 
PdfExtra (NAM). For ICDAR 2013 dataset, the best feature 
combination is NAM + PTD. And the other empirical study 
displays that the feature combination of NAM + PTD + NEP 
leads to the best performance on ACL Anthology dataset. 

5.2. Impact of classifiers 

Besides the combinations of features, three classifiers 
may also perform variously, due to their different hypotheses 
of mathematical modeling. Therefore, we map both ICDAR 
2013 and ACL Anthology datasets to the same feature com¬ 
bination {NAM + PTD + NEP) schema, and iteratively select 
an individual classifier, such as Naive Bayes {PdfExtra(NB)), 
Logistic Regression {PdfExtra(LR)) or Support Vector Ma¬ 
chine {PdfExtra(SVM)), to compare with the voting version 
{PdfExtra). They are the layout feature only {NAM), the 
layout with part-of-speech feature {NAM + PTD) and the 
combination of all the features {NAM + PTD + NEP). Table 
6 and 7 show the performances on ICDAR 2013 and ACL 
Anthology datasets respectively, and Naive Bayes classifier 
behaves stably on the two benchmark datasets. 

5.3. Impact of domains 

The most significant perspective of our new paradigm 
that needs to be discussed, is the evaluation on cross-domain 
datasets. It directly reflects the generality of a paradigm. If it 
could only outperform the state-of-the-art approaches when 
trained and tested by the PDF documents in the same or 
specific domain, the paradigm would still be a trial version 
that make minor contributions on the research of table region 
detection. Therefore, An experiment is set in which we 
feed the training examples of ACL Anthology dataset to our 
model, and test the performance on the testing set of ICDAR 
2013. Fortunately, testing on files of ICDAR 2013 achieves 
comparable performance with testing on ACL Anthology 
dataset. Moreover, PdfExtra (ACL) shows the better capa¬ 
bility on detecting tables on government documents after 


trained by academic articles. The reason why our paradigm 
can handle cross-domain files, is that all the features and 
classifiers we leverage are independent from the layouts, 
and even the contents within diverse PDF documents. 

6. Conclusion 

In this paper, we have contributed a novel paradigm 
for detecting the region of tables within PDF documents. It 
absorbs superiorities from both supervised and unsupervised 
approaches, and firstly covers almost tens of thousands PDF 
documents in a different domain. To be specific, it leverages 
different supervised learning models to adapt varies layouts 
and linguistic features within tables from large-scale PDF 
files, but costs free on labeling training corpus. We integrate 
the paradigm into our system PdfExtra which enhances the 
off-the-shelf software Apache PDFBox to predict whether 
a text line belongs to a table or not. Three classification 
models have been evaluated, which are Logistic Regression, 
Support Vector Machine, and Naive Bayes. We find out 
that Naive Bayes performs stable prediction on both two 
benchmark datasets, and linguistic features bring a great leap 
forward on the performance. What’s more, we prove that our 
paradigm is robust to table detection on open-domain PDF 
documents. 

However, the idea of weakly labeled paradigm can not 
avoid bringing noise into training data which impacts the 
performance of system. In the future, we look forward to 
exploring the correlation between tables within the same 
article to filter out the faults. 
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